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Evaluating  Natural  Language  Systems 


Recent  years  have  seen  a  proliferation  of  computer  systems  for  natural  lan¬ 
guage  processing  (NLP).  These  include  front  ends  to  databases,  expert  sys¬ 
tems  and  tutoring  systems.  Such  systems  generally  come  with  a  list  of 
inputs  (typically  single  sentences)  that  the  system  is  claimed  to  ‘handle’. 
The  problem  in  judging  these  systems  is  that  it  is  very  difficult  to  tell  from 
the  examples  just  what  claims  are  being  made.  If  one  of  the  examples  in¬ 
cludes  an  ellipsis,  does  that  mean  the  system  handles  ellipsis  in  general?  Or 
only  certain  kinds?  What  is  ellipsis  ‘in  general’?  Are  there  different  kinds 
of  ellipsis  that  require  different  kinds  of  understanding? 

Evaluating  these  claims  requires  that  we  know  what  inputs  the  system  should 
handle  and  what  it  would  mean  to  understand  the  input.  Testing  under¬ 
standing  is  easier  for  applied  systems  since  there  is  gene,  ally  a  specific  task 
involved,  e.g.,  accessing  a  database.  But  deciding  what  inputs  should  be 
handled  is  more  difficult  because  there  is  no  general  agreement  on  what 
kinds  of  linguistic  phenomena  iliere  are.  Without  a  common  classification 
of  the  problems  in  natural  language  understanding  authors  have  no  way  to 
specify  clearly  what  their  systems  do,  potential  users  have  no  way  to  com¬ 
pare  different  systems  and  researchers  have  no  way  to  judge  the  advantages 
or  disadvantages  of  different  approaches  to  developing  NLP  systems. 

This  paper  reports  progress  in  development  of  evaluation  methodologies  for 
natural  language  systems.  This  work  is  part  of  the  Artificial  Intelligence 
Measurement  System  (AIMS)  project  of  the  Center  for  the  Study  of  Evalu¬ 
ation  at  UCLA. 


Previous  Work 

These  problems  have  been  discussed  for  some  time  in  computer  science  NLP 
work  but  there  has  been  very  little  work  in  developing  actual  evaluative  crite¬ 
ria.  Woods  (1977)  discussed  the  taxonomic  approach  and  pointed  out  some 
of  its  strengths  and  weaknesses.  Guida  and  Mauri  (1984,  1986)  discuss  a 
formal  model  which  involves  measuring  the  correctness  of  the  understanding 
and  averaging  it  over  a  weighted  set  of  inputs.  But  this  method  assumes 
that  we  can  describe  a  weighting  for  (categories  of)  inputs. 


The  Sourcebook 


In  developing  evaluative  criteria  for  NLP  systems  we  had  several  gu«Js  in 
mind.  First,  the  criteria  used  should  be  applicable  over  the  broadest  possible 
range  of  systems  and  still  provide  comparability  of  the  systems.  Second,  the 
system  shouldn’t  just  be  rated  on  a  pass/fail  count.  It  should  outline  areas 
of  competence  so  that  implementors  can  see  where  further  work  is  needed 
in  their  system.  They  should  be  able  to  say  “this  approach  handles  types 
1,  2  and  3  of  ellipsis  but  not  types  4  and  5  yet”  rather  than  “this  approach 
handles  ellipsis”.  Third,  the  criteria  used  should  be  comprehensible  to  the 
gener?)  vrer  and  to  researchers  outside  computational  linguistics.  We  need  to 
present  the  issues  in  such  a  way  that  the  user  can  make  judgments  about  the 
importance  of  different  components  of  the  evaluation.  This  means  presenting 
the  issues  in  terms  of  the  general  principles  involved  and  giving  concrete 
examples.  This  approach  also  allows  us  to  bring  in  information  from  areas 
like  education,  psycholog}*,  sociology,  law  and  literary  analysis  and  enables 
researchers  in  those  areas  to  contribute  to  the  evaluation. 

To  this  end,  we  are  building  a  database  of  exemplars  of  representative  prob¬ 
lems  in  natural  language  understanding,  mostly  from  the  computational 
linguistics  literature.  Each  exemplar  includes  a  piece  of  text  (sentence,  di¬ 
alogue  fragment,  etc.)  a  description  of  the  conceptual  issue  represented,  a 
detailed  discussion  of  the  problems  in  understanding  the  text  and  a  reference 
to  a  more  extensive  discussion  in  the  literature.  The  Sourcebook  consists 
of  a  large  set  of  these  exemplars  and  a  conceptual  taxonomy  of  the  types 
of  issues  represented  in  the  database.  The  exemplars  are  indexed  by  source 
in  the  literature  and  by  conceptual  class  of  the  issue  so  that  the  user  can 
readily  access  the  relevant  examples.  The  Sourcebook  provides  a  structured 
representation  of  the  coverage  that  can  be  expected  of  a  natural  language 
system. 

Rather  than  start  with  a  particular  theory  of  language,  we  began  with  a 
search  of  the  computational  linguistics  literature.  While  no-one  would  claim 
that  computational  linguistics  has  discovered,  let  alone  solved,  every  prob¬ 
lem  in  language  use,  twenty-five  years  of  research  has  covered  a  broad  range 
of  problems.  Looking  at  language  use  computationally  focuses  attention  on 
phenomena  that  are  often  neglected  in  more  theoretical  analyses.  Building 
systems  intended  to  read  real  text  or  interact  with  real  users  raises  complex 
problems  of  interaction  of  linguistic  phenomena.  The  exemplars  are  mostly 
taken  from  the  literature  although  we  have  added  examples  to  feel  in  gaps 
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where  we  felt  the  published  examples  were  incomplete.  Because  many  of 
the  published  cases  involved  particular  systems,  the  examples  are  often  dis¬ 
cussed  in  the  literature  in  relation  to  that  system.  In  the  exemplars,  we 
analyze  the  example  in  terms  of  the  general  issue  represented.  Then  the  ex¬ 
emplars  are  grouped  into  categories  of  related  problems.  This  will  generate 
the  hierarchical  classification  of  the  issues. 


Continuing  and  Future  Work 

We  have  several  hundred  exemplars  and  we  estimate  that  we  have  covered 
10  per  cent  of  the  relevant  literature  (journals,  proceedings  volumes,  disser¬ 
tations,  major  textbooks)  in  computational  linguistics,  artificial  intelligence 
and  cognitive  science. 

We  are  continuing  to  add  exemplars  to  the  Sourcebook  and  are  elaborating 
the  classification  scheme.  We  will  be  making  the  Sourcebook  available  to 
other  researchers  for  comment  and  analysis. 
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A  Sample  Exemplar 

(1)  The  next  day  after  we  sold  our  car,  the  buyer  returned  and  wanted  his 
money  back.  (Allen,  1987,  p.  346) 

(2)  The  day  after  we  sold  our  house,  the  escrow  company  went  bankrupt. 

(3)  The  day  after  we  sold  our  house,  they  put  in  a  traffic  light  at  the  corner. 


Topic 

Anaphoric  reference  -  roles. 


Discussion 

In  (1)  tne  ‘buyer’  refers  back  to  a  figure  in  one  of  the  roles  in  the  ‘selling  a  car’ 
event.  The  system  must  search  not  only  the  direct  possible  antecedents  (the 
‘selling’)  but  must  also  consider  aspects  of  the  selling  to  resolve  the  reference. 
In  (1),  there  is  nothing  specific  to  ‘car’  about  resolving  the  reference.  But  in 
(2),  finding  the  reference  of  ‘the  escrow  company’  involves  looking  past  the 
general  “buying”  script  and  searching  through  aspects  of  selling  specific  to 
selling  houses.  There  is  a  general  problem  here  with  controlling  the  amount 
of  search  while  still  looking  deep  enough.  In  (3),  the  system  has  to  go 
from  the  house  to  the  location  to  the  street  to  the  corner  to  understand  the 
reference. 


Reference 

Allen,  J.  F.  (1987).  Natural  Language  Understanding.  Menlo  Park,  CA: 
Benjamin/Cummings. 
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